Summary

This vignette introduces SAEON’s ipayipi bulk data processing pipeline for groundwater data.

  1. Install R ipayipi
  2. Initiate pipeline & automate metadata consolidation in R objects (mobile app template available)
  3. Start batch processing ...


  1. Built in data visualization with ggplot or interactive dygraphs (below) ...

Introduction

This vignette provides an overview of the groundwater data processing pipeline with some practical examples. There are three main steps in the ipayipi groundwater data processing pipeline. Each of the steps takes place in it’s own directory:

  1. wait_room where data is fed into the pipeline.
  2. nomvet where standardized native data is stored.
  3. solr_dta where native data is transformed into an R format, cleaned and stored.



To create corresponding directories navigate to your desired working directory and use the pipe_init_gw() function.

pipe_dir  <- "data/pipeline_gw" # pipe directory

pipe_init_gw(work_dir = pipe_dir)

# check that the folders have been created
list.dirs(pipe_dir)
## [1] "data/pipeline_gw"             "data/pipeline_gw/nomvet_room" "data/pipeline_gw/solr_room"   "data/pipeline_gw/wait_room"


Importing data into the waiting room

This section of the pipeline uses three main functions:

  1. xle_import() — brings new data into the pipeline
  2. xle_nomenclature() — checks naming and file metadata
  3. xle_rename()


1. xle_import()

Only Solonist data (‘.xle’) files are processed within ipayipi at this stage. To import Solonist data files use the xle_import() function. Note that the encoding of the file will be standardized to UTF8 & the way in which units are reported upon import. Raw data is only copied (not cut) so your original data will be unaffected.

xle_import(
    input_dir = "data/", # raw data directory
    wait_room = "data/pipeline_gw/wait_room/", # 'wait_room' directory
    recurr = FALSE # not searching recursively in the 'input_dir'
    )
## |================ Importing 5 solonist files into the pipeline ================|
## |- CF01B-201912.xle --> Sibayi Coastal Forest_CF01B_20191115-20191212.xle------|
## |- CF01B-202001.xle --> Sibayi Coastal Forest_CF01B_20191115-20200115.xle------|
## |- CF01B-202007.xle --> Sibayi Coastal Forest_CF01B_20200116-20200707.xle------|
## |- SIBHAYI_BARO_CAMP_20180613-20200116.xle --> SIBHAYI_BARO_CAMP_20180613-20200116.xle-|
## |- SIBHAYI_BARO_CAMP_20200116-20201014.xle --> SIBHAYI_BARO_CAMP_20200116-20201014.xle-|
## |=============================== Import complete ==============================|

Note that the data has been renamed according to the date range of respective files.


2. xle_nomenclature()

In any pipeline it is important to vet file names and metadata. Without vetting different data streams may erroneously get appended and processed together. To standardize and check how data files are named and their metadata is stored (including spelled) a this function must be run next in the pipeline.

xle_nomenclature(
    wait_room = "data/pipeline_gw/wait_room/", # 'wait_room' directory
    out_csv = TRUE
    )
## There are unconfirmed identities in the nomenclature!
## A csv file has been generated with updated nomenclature.
## Please complete the csv file to remove NAs.
##                              Orig_FileName           Location_uz Borehole_uz      SN Instrument_type Model      SDT Location Borehole b__special
## 1: SIBHAYI_BARO_CAMP_20180613-20200116.xle               SIBHAYI   BARO_CAMP 2042643         LT_EDGE  M1.5 20180613     <NA>     <NA>           
## 2:                        CF01B-201912.xle Sibayi Coastal Forest       CF01B 2095357         LT_EDGE   M10 20191115     <NA>     <NA>

The function checks file names and spelling, that is, file nomenclature against an automatically generated nomenclature table. If there are any new names or sites the function will generate a csv with all unique file names which have entered the data pipeline. Spelling errors and misnamed files entering the pipeline for the first time are tagged. The out_csv should always be set to TRUE so that when running xle_nomenclature() this csv is automatically generated. You can edit the csv file as below to standardize the file names.

The example above demonstrate the type of metadata that is extracted from the raw solonist files (e.g. logger serial number). The uz in for example the “Location_uz” column designates the unstandardized site ‘Location’ of the Solonist logger. The standardized “Location” name was entered in column I of the spreadsheet editor. Only the cells/values marked with R’s no data holder NA require editing at this stage.

Once the nomenclature in the ‘nomtab_YYYYMMDD-HH:MM.csv’ has been corrected as in the example above — the updated csv file can be imported using read_nomtab_csv() this function will rerun xle_nomenclature() to check that naming issues have been resolved.

read_nomtab_csv(
  wait_room = "data/pipeline_gw/wait_room/",
  file = "data/nomtab_example.csv"
)
##                              Orig_FileName           Location_uz Borehole_uz      SN Instrument_type Model      SDT Location  Borehole b__special
## 1: SIBHAYI_BARO_CAMP_20180613-20200116.xle               SIBHAYI   BARO_CAMP 2042643         LT_EDGE  M1.5 20180613  SIBHAYI BARO_CAMP           
## 2:                        CF01B-201912.xle Sibayi Coastal Forest       CF01B 2095357         LT_EDGE   M10 20191115  SIBHAYI        CF

It is important to use xle_nomenclature() each time data is fed into the pipeline but, if there are no new sites or variations in spelling thereof, editing the nomenclature table is not necessary. The nomenclature table can also be edited directly in R, however, it is safer to edit the automatically generated csv file as above.


3. xle_rename()

Now that the nomenclature has been vetted the data files in the ‘waiting room’ are renamed and their internal metadata adjusted to match the pipeline’s nomenclature table.

xle_rename(
  wait_room = "data/pipeline_gw/wait_room/",
  out_csv = TRUE
)
## |==================== correcting nomenclature in 5 files... ===================|
## |-Sibayi Coastal Forest_CF01B_20191115-20191212.xle  --->>  SIBHAYI_CF_20191115-20191212.xle-|
## |-Sibayi Coastal Forest_CF01B_20191115-20200115.xle  --->>  SIBHAYI_CF_20191115-20200115.xle-|
## |-Sibayi Coastal Forest_CF01B_20200116-20200707.xle  --->>  SIBHAYI_CF_20200116-20200707.xle-|
## |-SIBHAYI_BARO_CAMP_20180613-20200116.xle  --->>  SIBHAYI_BARO_CAMP_20180613-20200116.xle-|
## |-SIBHAYI_BARO_CAMP_20200116-20201014.xle  --->>  SIBHAYI_BARO_CAMP_20200116-20201014.xle-|
## |=========================== nomenclature corrected ===========================|


Archiving standardized native groundwater data

From the ‘waiting room’ data files that have had their nomenclature and standards vetted are pushed for archiving into ‘nomvet room’. While transferring files a log of which files have moved from the ‘waiting room’ to the ‘nomvet room’ is generated and stored in the ‘nomvet room’. This log is useful for future reference, i.e., to check what files are available. In addition, the log is by the pipeline to prevent duplication of archived imports in the ‘nomvet room’. I use the vetted native Solonist files archived the ‘nomvet room’ for sharing with collaborators who prefer using Solonist software for processing their data.

1. xle_transfer()

xle_transfer() is used to copy vetted Solonist files into the ‘nomvet room’.

xle_transfer(
  wait_room = "data/pipeline_gw/wait_room/",
  nomvet_room = "data/pipeline_gw/nomvet_room/"
)
## There are no files in the nomvet folder
## |==================== moving 5 files to the nomvet room... ====================|
## |====================== arrived safely in the nomvet room =====================|
## $dups
## NULL
## 
## $copied
##     Location Borehole      SN               Start                 End                               file.name     place
## 1: BARO_CAMP  SIBHAYI 2042643 2018-06-13 15:00:00 2020-01-16 10:58:13 SIBHAYI_BARO_CAMP_20180613-20200116.xle wait_room
## 2: BARO_CAMP  SIBHAYI 2042643 2020-01-16 12:00:00 2020-10-14 16:18:42 SIBHAYI_BARO_CAMP_20200116-20201014.xle wait_room
## 3:        CF  SIBHAYI 2095357 2019-11-15 11:00:00 2019-12-12 16:45:19        SIBHAYI_CF_20191115-20191212.xle wait_room
## 4:        CF  SIBHAYI 2095357 2019-11-15 11:00:00 2020-01-15 16:36:57        SIBHAYI_CF_20191115-20200115.xle wait_room
## 5:        CF  SIBHAYI 2095357 2020-01-16 12:00:00 2020-07-07 16:12:18        SIBHAYI_CF_20200116-20200707.xle wait_room
## 
## $nomvet_log
##     Location Borehole      SN               Start                 End                               file.name       place
## 1: BARO_CAMP  SIBHAYI 2042643 2018-06-13 15:00:00 2020-01-16 10:58:13 SIBHAYI_BARO_CAMP_20180613-20200116.xle nomvet_room
## 2: BARO_CAMP  SIBHAYI 2042643 2020-01-16 12:00:00 2020-10-14 16:18:42 SIBHAYI_BARO_CAMP_20200116-20201014.xle nomvet_room
## 3:        CF  SIBHAYI 2095357 2019-11-15 11:00:00 2019-12-12 16:45:19        SIBHAYI_CF_20191115-20191212.xle nomvet_room
## 4:        CF  SIBHAYI 2095357 2019-11-15 11:00:00 2020-01-15 16:36:57        SIBHAYI_CF_20191115-20200115.xle nomvet_room
## 5:        CF  SIBHAYI 2095357 2020-01-16 12:00:00 2020-07-07 16:12:18        SIBHAYI_CF_20200116-20200707.xle nomvet_room
## 
## attr(,"class")
## [1] "SAEON_solonist"


First then function noted that before the transfer there were no Solonist ‘xle’ files in the ‘nomvet room’. After transferring files the function returns a list of any duplicates that were already in the ‘nomvet room’ and therefore not copied across, plus, a list of files transferred and then a printout of the automatically generated ‘nomvet_log’.

Processing groundwater data in R

This is where the fun starts. The data is now ready to be converted from it’s native web (xml) format into R. Once in R format the data can be processed into meaningful data. The data processing consists of the following steps:

  1. site_series() — appends data
  2. batch_baro_comp() — barometric compensation
  3. datum and log retrieve import
  4. outlier detection
  5. drift correction

1. site_series() — append data

Importantly, the standardized names of the files stored in the ‘nomvet room’ are used as the identifiers for appending Solonist data files. A call of site_series() will append all files in the ‘nomvet_room’ with the same file names and save them as a native R (rds) file in the Solonist R data directly or ‘solr_room’. In the pipeline design, each time data is imported into the ‘nomvet room’, site_series() should be run. It will detect new files and append these to extant data, or if the site is new, it will generate a new native R file in the ‘solr_room’. Exactly how data appended is determined by xle_to_R() and rdta_append() — check their documentation for details and how to use these outside of the pipeline.

site_series(
  solr_room = "data/pipeline_gw/solr_room/", # native R data
  nomvet_room = "data/pipeline_gw/nomvet_room/" # standardized solonist data
)
## |======================= batch append data + metadata... ======================|
## |----- Processing: SIBHAYI_BARO_CAMP_20180613-20200116.xle --------------------|
## |      Transfer log updated for: SIBHAYI_BARO_CAMP_20180613-20200116.xle       |
## |----- Finished: SIBHAYI_BARO_CAMP_20180613-20200116.xle ----------------------|
## |----- Processing: SIBHAYI_BARO_CAMP_20200116-20201014.xle --------------------|
## |      Transfer log updated for: SIBHAYI_BARO_CAMP_20200116-20201014.xle       |
## |----- Finished: SIBHAYI_BARO_CAMP_20200116-20201014.xle ----------------------|
## |----- Processing: SIBHAYI_CF_20191115-20191212.xle ---------------------------|
## |      Transfer log updated for: SIBHAYI_CF_20191115-20191212.xle              |
## |----- Finished: SIBHAYI_CF_20191115-20191212.xle -----------------------------|
## |----- Processing: SIBHAYI_CF_20191115-20200115.xle ---------------------------|
## |      Transfer log updated for: SIBHAYI_CF_20191115-20200115.xle              |
## |----- Finished: SIBHAYI_CF_20191115-20200115.xle -----------------------------|
## |----- Processing: SIBHAYI_CF_20200116-20200707.xle ---------------------------|
## |      Transfer log updated for: SIBHAYI_CF_20200116-20200707.xle              |
## |----- Finished: SIBHAYI_CF_20200116-20200707.xle -----------------------------|
## |=========================== data seriez appended... ==========================|
##          transfer_time                                xle_file                          from                          to
## 1: 2021-10-22 08:51:28 SIBHAYI_BARO_CAMP_20180613-20200116.xle data/pipeline_gw/nomvet_room/ data/pipeline_gw/solr_room/
## 2: 2021-10-22 08:51:29 SIBHAYI_BARO_CAMP_20200116-20201014.xle data/pipeline_gw/nomvet_room/ data/pipeline_gw/solr_room/
## 3: 2021-10-22 08:51:29        SIBHAYI_CF_20191115-20191212.xle data/pipeline_gw/nomvet_room/ data/pipeline_gw/solr_room/
## 4: 2021-10-22 08:51:29        SIBHAYI_CF_20191115-20200115.xle data/pipeline_gw/nomvet_room/ data/pipeline_gw/solr_room/
## 5: 2021-10-22 08:51:30        SIBHAYI_CF_20200116-20200707.xle data/pipeline_gw/nomvet_room/ data/pipeline_gw/solr_room/


The output messages from the function demonstrate the live progression of the function (and are useful for debugging). Two log files are kept in the ‘solr_room’: the ‘transfer_log’ keeps a record of which files have been transferred from the `nomvet_room’ and is used for checking which files need appending/updating; the ‘rdta_log’ contains a catalogue of all data files in the ‘solr_room’ and related metadata which is used further down the pipeline.

2. batch_baro_comp() — barometric compensation

Barometric compensation of water level data determined with a pressure transducer needs to be compensated with the atmospheric pressure which varies over time. Barometric compensation using an appropriate ‘barologger’ (barometer sensor) can be done for individual files using the baro_compensate() function. When processing bulk records in ipayipi batch_baro_comp() unique barologger files can be specified as below — use the fill location and ‘borehole’ (or piezometer etc) name.

batch_baro_comp() selects the appropriate barologger file for the barometric compensation using the ‘rdta_log’ in the ‘solr_room’. If a barologger for a particular sensor is not listed in the ‘rdta_log’ — follow the prompts and fill in the temporary ‘rdta_log’ csv file. The csv can be re-imported into the pipeline as a modified ‘rdta_log’ using read_rdta_log() (take care to only edit the necessary column).

batch_baro_comp(
  input_dir = "data/pipeline_gw/solr_room/", # native R data
  out_csv = TRUE, # if a barologger has not been specified
                  # - csv output for editing
  overwrite_baro = TRUE, # if barologger data overlap will overwrite old data
  overwrite_comp = TRUE, # redo old compensation calculations 
  join_tolerance = NULL, # for fuzzy joins -- automatically set if NULL
  na_thresh = 5,
  overwrite_t_bt_level = TRUE, # overwrite old drift corrected data
  prompted = FALSE # if FALSE all files in directory will be processed
)
## |========================= barometric compensation... =========================|
## |--- Opened: SIBHAYI_CF and SIBHAYI_BARO_CAMP ---------------------------------|
## |    Compensating: SIBHAYI_CF...                                               |
## |    SIBHAYI_CF compensated                                                    |
## |==============================================================================|

3. physiCally() — metadata import

This step imports field metadata that will be used for outlier detection and drift correction. This data is extracted from csv files which are stored in the pipelines ‘solr_room’. The must be imported in a standard format — column names, units etc as per the example data below.

physiCally(
  input_dir = "data/pipeline_gw/solr_room/", # native R data
  recurr = FALSE,
  overwrite_logs = TRUE,
  first_dip = TRUE,
  prompted = FALSE
)
## |====================== updating datum and retrieve logs ======================|
## |--->~ SIBHAYI_CF ~<-----------------------------------------------------------|
## |   Datum updated                                                              |
## |   Log retrieve updated                                                       |
## |======================= datum and retrieve logs updated ======================|

4. naartjie_clean() — segmented anomaly detection & interpolation

To interpolate missing and/or anomalous barometrically compensated water level data, the approach taken in this data pipeline, is to segment the data into intervals defined by logger ‘disturbance’ events. Disturbance events are when the logger was removed (or interfered with) when downloading data or conducting maintenance etc. After ‘disturbing’ the logger and redeploying it, ‘level shifts’ may occur that require specific level shift anomaly detection and correction algorithms. In groundwater level processing, correction of possible level shifts associated with these ‘disturbance’ events ties in to later corrections using segmented linear drift correction based on manually acquired ‘dipper’ readings of the water level (static water readings).

Segmentation is done automatically based on metadata detailing logger disturbance events, imported with the raw Solonist logger data using xle_import() and physiCally(). Within segments defined by ‘disturbance’ events, the non-linear/non-parametric hampel filter is applied (i.e., there are no violations of non-linear & time series data assumptions) (‘disturbance’ events, illustrated here are marked with vertical dashed lines). Notice how at the third ‘disturbance’ event, an ‘extreme’ outlier was detected (at ~49.1m on 2017-11-09); this occurs when the logger takes a reading out of the water when the logger and is being serviced/downloaded. The sensitivity of naartjie_cleen() should be explored and adjusted for each site processed in the data pipeline.

naartjie_clean(
  input_dir = "data/pipeline_gw/solr_room/", # native R data
  seg_na_t = 0.75, # tolerance fraction of missing data to apply algorithm
  w_size = 13, # can train the window size for each site
  mad_dev = 3, # accepted default (increase for more wiggly data)
  last_rule = FALSE, # defaul to last rule
                     # useful for bulk pipeline processing
  tighten = 0.65, # makes the rule stricter near disturbance events
                  # fraction by which to multiply 'mad_dev' within the
                  # window region of logger 'disturbance' events
  prompted = FALSE # interactive site selected
)
## |=============== non-linear anomaly detection and interpolation ===============|
## |-------------------------- segmenting SIBHAYI_CF... --------------------------|
## 

|  seg 1: 2019-11-15 11:00:00 --> 2019-12-12 16:30:00     | Dflt rule: hf_13_3 |
## 

|  seg 2: 2019-12-12 17:00:00 --> 2020-01-15 16:30:00     | Dflt rule: hf_13_3 |
## 

|  seg 3: 2020-01-15 17:00:00 --> 2020-01-16 12:00:00     | Too many NAs to clean segment.
## 

|  seg 4: 2020-01-16 12:30:00 --> 2020-02-13 17:30:00     | Dflt rule: hf_13_3 |
## 

|  seg 5: 2020-02-13 18:00:00 --> 2020-03-18 15:30:00     | Dflt rule: hf_13_3 |
## 

|  seg 6: 2020-03-18 16:00:00 --> 2020-06-04 17:00:00     | Dflt rule: hf_13_3 |
## 

|  seg 7: 2020-06-04 17:30:00 --> 2020-07-07 16:00:00     | Dflt rule: hf_13_3 |
## |============================= naartjies consumed =============================|

See an interactive naartjie_clean() session here…

5. driffter() — drift correction

All or a subset of files selected can be drift corrected (linear) using this interactive function. Up until this step, water level data has only been corrected using atmospheric pressure (and other pre-set deployment information). The manual calibrations, know as, ‘dipper’ or ‘static’ water level readings, imported with the preceding function, are used to do the drift correction.

Similar to naartjie_clean(), driffter() segments the data but by ‘dipper’ reading interference events. Within each of these segments data is ‘pivoted’ to dipper readings using linear algebra. The amount by which data is offset from pivot point, is estimated as the difference between the ‘dipper’ reading, and, the average (mean or median) of a defined number of observations surrounding the pivot point (for each segment the offset is calculated based on observations only within a segment). The number of observations around the pivot point can be automatically expanded if there is missing data, up to a specified limit. If there is a relatively high degree of wiggliness in the water level, set the number of observations (that is, window size) smaller to avoid inaccurate offsets—future work should base the window size dynamically based on the rate of change in water level.

The result of drift correction can be plotted using the built-in gw_vis_dy() function, e.g., here…and compared to the ‘drift neutral’ data, that is, the data series which has not been drift corrected, with the ‘t level’, that is, drift corrected data series. Crosses indicate calibration measurements on the plot. Note how the groundwater (or other water) level data is aligned with the height of the water level above (or below) mean sea level using the calibration measurements (plus datum and other borehole metadata) through drift correction.

driffter(
  input_dir = "data/pipeline_gw/solr_room/", # native R data
  rng_side = 3, # min window side size (for offset estimation)
  rng_max = 5, # max window side size to expand to if no data (for offset estimation)
  robust = TRUE, # use median (TRUE) or mean (FALSE) in offset calculations
  prompted = FALSE
)
## |=========================== linear drift correction ==========================|
## |    SIBHAYI_CF                                                                |
## |=============================== drift corrected ==============================|